home *** CD-ROM | disk | FTP | other *** search
-
-
- Serial Indexer Overview
- Brewster Kahle
-
- The serial indexer is a simple inverted file system that is not very
- different from existing IR systems.
-
- DATABASE FILES
-
- The serial indexer parses files and creates an inverted file index made up
- of 7 files. For a database named "index" the files would be:
- * index.inv -- the "postings file" that is a term followed by a list of
- entries each of which describe where that word occurs in the original files.
- A posting is a weight, doc_number (see the doc file), and character_position.
- This file is indexed with the dictionary (dct) file. The terms are in
- alphabetical order.
- * index.doc -- this is a linear list of document-entries one for each
- document. A document can be a complete file or a piece of a file (such as
- mail files that are the concatentation of many messages, each message is a
- document). The information kept in each entry is:
- filename_id: position into the filename file (fn) of the filename
- for this document.
- headline_id: position in the headline file.
- start_character: position in the file where this document starts
- end_character: end position. 0 if complete file.
- document_length: in characters
- number_of_lines
- date: time_t
- * index.fn -- list of the filenames in the database with the write-date
- of the file and the type of the file. Type is a string. Indexed by the
- position in the file, so this file can not be edited after the index is built.
- * index.hl -- list of the headlines. Indexed by position.
- * index.dct -- dictionary file which is a 2 level b-tree. The first block
- is pointers to the every 1000th entry in the rest of the dictionary file.
- Each entry is a fixed-length record of the word with the position into the
- rest of the file. The rest of the file are blocks just like the first
- block, but each entry is the word plus the position of it in the inverted
- file (inv). The whole dictionary is in alphabetical order.
- * index.src -- source description that is used to access the database.
- It is also returned as a response to the "help" query. This file is not
- overwritten once it is created. Therefore database maintainers should edit
- this file to add a good desciption of what that database contains.
- * index.status -- only the ram based seeker uses this to describe itself
- and get parameters from the user.
-
-
- INDEXING
-
- A new index is built by parsing the input files, finding the words in it
- and creating the filename, headline, and doc tables at parse time. The
- words are then passed to routines defined in irext.h which define the
- frontend/backend boundary.
-
- The serial indexer creates intermediate inv files starting at inv0 then
- inv1 etc. These are created by accumulating the words in a memory
- hashtable. Each invN file is in alphabetical order, so they can be merged
- easily into one inv file. This merging is done logarithmically, so that it
- is fast. Unfortunately, it means that just before the merging is done, the
- index occupies twice the space that it will finally.
-
- On the final merge, a dictionary file is produced by dumping the positions
- of the start of terms in the final inverted file.
-
- Adding to an exising database is done by adding to the filename, headline,
- and doc tables as the parsing is done, but waiting to merge the new invN
- files into the old inv file until all the new merging is done.
- Theoretically it should be able to be done while always having the database
- usable.
-
-
- SEARCHING
-
- A query is parsed for its seed words and relevant documents and is passed
- to a function search_word (defined in irext.h). The backend does whatever
- it will do to search and set some state. Then best_hit is called
- iteratively to find the documents that most closely matched the query.
- These results are used to look up headlines and filenames to hand back to
- the user.
-
- The serial backend does this by loading the postings for a particular term
- and then adding it into a score array. The score array has an entry for
- each document. A document gets its score increased by containing terms in
- the query.
-
-
- Roughly the goals of this design were:
- Make a flexible platform for experimentation
- Portable
- Low search overhead for fast startup and lightweight access
- Usable in pieces for multiple search engines backends
- Fast searching
- Easy to implement
- Size of the resulting database was not a priority
-
-